Model Selection

Multimodal Instruction Understanding

# Multimodal Instruction Understanding

Pixelreasoner RL V1

PixelReasoner is a vision-language model based on Qwen2.5-VL-7B-Instruct, trained with curiosity-driven reinforcement learning, focusing on image-text-to-text tasks.

Transformers English

Qwen2.5-VL-7B-Instruct is a multimodal model based on the Qwen2.5 architecture, supporting joint processing of images and text, suitable for vision-language tasks.

Safetensors English

Llama 4 Scout 17B 16E Instruct FP8 Dynamic

A 17B-parameter multilingual instruction model based on Llama-4, optimized with FP8 quantization to significantly reduce resource requirements

Image-to-Text Supports Multiple Languages

Qwen.qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a 32B-parameter-scale multimodal vision-language model that supports joint understanding and generation tasks for images and text.

Qwen2.5 VL 32B Instruct W4A16 G128

Qwen2.5-VL-32B-Instruct is a 32B-parameter multimodal large language model supporting vision and language tasks, suitable for complex multimodal interaction scenarios.

Qwen2 VL 7B Visual Rft Lisa IoU Reward

Qwen2-VL-7B-Instruct is a vision-language model based on the Qwen2 architecture, supporting multimodal input of images and text, suitable for various visual-language tasks.

Safetensors English

Qwen 2 VL 7B OCR

A fine-tuned version of the Qwen2-VL-7B model, trained using Unsloth and Huggingface's TRL library, achieving a 2x speed improvement.

Transformers English

Llama 3.2 11B Vision OCR

Llama 3.2-11B vision-instruction model optimized with Unsloth, 4-bit quantized version, training speed increased by 2x

Large Language Model

Transformers English

Llama 3 2 11b Vision Electrical Components Instruct

Llama 3.2 11B Vision Instruct is a multimodal model combining vision and language, supporting image-to-text tasks.

Image-to-Text English

Qwen2.5 VL 7B Instruct 4bit

A multimodal model fine-tuned based on Qwen2.5-VL-7B-Instruct, utilizing the Unsloth acceleration framework and TRL library for training, achieving a 2x speed improvement

Transformers English

Pixtral Large Instruct 2411

Pixtral-Large-Instruct-2411 is a multimodal instruction fine-tuned model based on MistralAI technology, supporting image and text input with multilingual processing capabilities.

Transformers Supports Multiple Languages

Qwen2 VL 7B Instruct GGUF

Qwen2-VL-7B-Instruct is a 7B-parameter multimodal model supporting image-text interaction tasks.

Image-to-Text English

Qwen2 VL 7B Instruct Onnx

This is a vision-language model based on the Qwen2-VL architecture with 7B parameters, supporting image understanding and instruction interaction.

Openvla 7b Finetuned Libero 10

This model is a vision-language-action model obtained by fine-tuning the OpenVLA 7B model using the LoRA method on the LIBERO-10 dataset, suitable for the field of robotics.

Transformers English

Openvla 7b Finetuned Libero Goal

This is an OpenVLA 7B vision-language-action model fine-tuned using LoRA technology on the LIBERO-Goal dataset, suitable for the robotics field.

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase